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Abstract 

MALINA is a web service for bioinformatic analysis of whole-genome metagenomic data obtained from human gut 
microbiota sequencing. As input data, it accepts metagenomic reads of various sequencing technologies, including 
long reads (such as Sanger and 454 sequencing) and next-generation (including SOLiD and lllumina). It is the first 
metagenomic web service that is capable of processing SOLiD color-space reads, to authors' knowledge. The web 
service allows phylogenetic and functional profiling of metagenomic samples using coverage depth resulting from 
the alignment of the reads to the catalogue of reference sequences which are built into the pipeline and contain 
prevalent microbial genomes and genes of human gut microbiota. The obtained metagenomic composition 
vectors are processed by the statistical analysis and visualization module containing methods for clustering, 
dimension reduction and group comparison. Additionally, the MALINA database includes vectors of bacterial and 
functional composition for human gut microbiota samples from a large number of existing studies allowing their 
comparative analysis together with user samples, namely datasets from Russian Metagenome project, MetaHIT and 
Human Microbiome Project (downloaded from http://hmpdacc.org). MALINA is made freely available on the web at 
http://malina.metagenome.ru. The website is implemented in JavaScript (using Ext JS), Microsoft .NET Framework, 
MS SQL, Python, with all major browsers supported. 
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Background 

Whole-genome sequencing of environmental samples is 
producing data at an increasing pace. With the advent of 
high-throughput next-generation sequencing (NGS) 
technologies, a deeper insight into phylogenetic and 
functional composition of metagenomes has become 
feasible. The research community has a need for robust 
data analysis tools that allow efficient description of 
composition, classification and clustering coupled with 
comprehensive visualization of results, while providing 
means for comparative analysis within the context of all 
accumulated metagenomic data for same type of envir- 
onment. There is a number of existing web services (in- 
cluding CAMERA [1], IMG/M [2], MG-RAST [3], 
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METAGENassist [4] and others) and stand-alone appli- 
cations (including QIIME [5] and SmashCommunity [6]) 
that integrate data visualization and statistical analysis 
functionalities with databases of publicly available meta- 
genomic data, allowing the user to compare his/her own 
samples with those of other researches. However, the 
number of pre-loaded human gut metagenomic samples 
in the repertoire of these tools is limited. 

The human gut microbiome is one of the most exten- 
sively studied subjects in metagenomic research. It is of 
particular interest to scientists because of its significant 
role in host health status. Representative reference gen- 
omes for many taxa have been sequenced, and a cata- 
logue of prevalent gut microbial genes has already been 
established [7]. MALINA exploits this accumulated 
knowledge in the form of reference sequence sets to 
provide a means for analyzing human gut whole-genome 
reads within the context of world public metagenomic 
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Table 1 Comparison of MALI N A and existing software allowing analysis of human gut metagenomic reads 



Name of 
software 


Type 


Number of 
included 




Supported input 
sequence formats 




Statistical analysis 






Visualization 








shotgun 
human gut 
microbiome 

samples 


454 


lllumina Sanger SOLiD 


Taxonomic 
profiling 


Functional Taxa 
profiling co-occurence 
analysis 


Sample 
clustering 


Group 
comparison 


Abundance 
plots 


PCA BCA MDS 


Hierarchical 
clustering 


CAMERA 


web-service 


36 


+ 


+ + 


+ 


+ 






+ 






IMG/M 


web-service 


149 


+ 


+ + 


+ 


+ 


+ 




+ 




+ 


MG-RAST 


web-service 


13 


+ 


+ + 


+ 


+ 


+ 


+ 


+ 


+ 


+ 


MALINA 


web-service 


357 


+ 


+ + + 


+ 


+ + 


+ 


+ 


+ 


+ + + 


+ 


METAGENassist 


web-service 


39 








+ 


+ 


+ 


+ 


+ - + 


+ 


QIIME 


stand-alone 


0 


+ 


+ + 


+ 




+ 


+ 


+ 


+ - + 


+ 


SmashCommunity 


stand-alone 


0 


+ 


+ 


+ 


+ 


+ 




+ 




+ 



Note: although METAGENassist does not accept read sequences as input, it does accept precomputed feature matrices that can be obtained from reads of any of the mentioned sequencing technologies with the help 
of third-party software. 
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datasets. The inclusion of a vast set of existing human 
gut metagenomic datasets allows the user to check 
which datasets are most similar to his/her own data, 
and, if present, to examine the metadata of those and 
pose hypotheses based on similarities. Features of MAL- 
INA and existing software allowing human gut whole- 
genome metagenomic reads analysis are compared in 
Table 1. 

Implementation 

The MALINA workflow is shown in Figure 1. As in- 
put, MALINA accepts short nucleotide reads of length 
starting with 35 bp. Color-space (SOLiD), as well as 
long (such as Sanger, 454) reads are supported. To our 
knowledge, MALINA is the first metagenomic analysis 
web-service supporting SOLiD color-space reads. It is 
beneficial, considering the increasing volume of meta- 
genomic data sequenced using this technology. Files 
with reads are uploaded by FTP. Through the web 
interface, the user creates groups of samples, with each 
sample including one or more read sets. The files for 
a given sample are associated with appropriate read 
sets and prepared for analysis. 

The MALINA analysis pipeline characterizes metage- 
nomic composition in two ways: phylogenetic and func- 
tional - by assessing relative abundance of microbial 
genera and genes, correspondingly. In case of genes, 
total metabolic potential of all microbes is described. 
The quantitative profiling is based on alignment of reads 
to reference genomes and gene catalogue. The genome 
catalogue contains more than 440 genomes of human 
gastrointestinal bacteria obtained from HMP, NCBI and 
relevant studies of human gut microbiome. The gene 
catalogue of prevalent human gut microbial genes dis- 
covered by MetaHIT project consists of 3.3 million 



genes. After the reads are aligned to reference set, the 
resulting position-wise coverage of each sequence is nor- 
malized by its length and total number of reads in read 
set. Summed over genera (for genomes) or functional 
groups (clusters of orthologous groups, COGs [8]) for 
genes, it yields relative abundance of phylogenetic and 
functional units. For functional profiling, COG annota- 
tion from MetaHIT gene catalogue is used. Each metage- 
nomic read set is thus described by two feature vectors. 

Feature vectors of the read sets selected by the user 
are subject to statistical analysis and visualization: box- 
plots of the most abundant genera/COGs, principal 
components analysis (PCA), clustering (partitioning 
around medoids [PAM] and hierarchical clustering) [9], 
multidimensional scaling (MDS) [10] and between-class 
analysis (BCA) [11] for the results of clustering. PCA 
plot shows 2D projection of feature vectors along the 
directions of maximum variance in the data, with arrows 
showing genera "drivers" that contribute most strongly 
to variation between samples. For COG groups, PCA 
plots are constructed separately for several functional 
classes: antibiotic resistance (COGs were collected from 
ARDB database [12]), transcription factors (COGs 
selected from total COG list according to description, 
i.e. "transcription regulator/factor/repressor ') and vitamin 
metabolism (COGs from KEGG [13] vitamin synthesis 
pathways). PAM clustering calculates the optimal number 
of clusters and assigns the samples to the clusters. BCA is 
a special case of PCA with respect to an instrumental vari- 
able (that is represented by cluster number here) produ- 
cing plot that highlights differences between the clusters. 
The second implemented clustering algorithm, hierarch- 
ical clustering, produces dendrogram heatmap of abun- 
dance. Sample visualizations produced by MALINA are 
shown in Figure 2. Moreover, statistical analysis includes 



Group of samples 1 



•Sample 1 

• Read set 1 a 

• Read set 1 b 
• 

•Sample 2 

• Read set 2a 

• Read set2b 



Group of samples 2 



Group of samples N 



Read alignment 
to reference 
genes and genomes 



Coverage normalization, 
aggregation by 
genus/COG 



Statistical analysis & visualization 

• PCA, BCA, MDS 

• Clustering 

• Group comparison 



Figure 1 MALINA workflow. Input data and the main stages of analysis are illustrated. 
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Figure 2 Examples of visualizations by MALINA. Metagenomic samples are denoted by text tags. (A) BCA plot of phylogenetic abundance, 
with the clusters visualized as ellipses and the samples connected with the centers of their respective clusters by lines. (B) PCA of COGs 
associated with vitamin metabolism, with drivers shown by red arrows. (C) Dendrogram heatmap of phylogenetic abundance based on 
hierarchical clustering algorithm (with samples as columns and genera as rows). 



detection of genera and gene categories discriminatory 
among the clusters by the Mann- Whitney test [14] and 
Random Forests algorithm [15] as well as taxa co- 
occurrence analysis based on the abundance values 
correlation. 

The plots can be downloaded as PDF files, and the 
relative abundance, clustering results and other output 
tables can be downloaded as tabulated text files. All fea- 
tures of MALINA are available without registration, via 
guest account. The user can register a dedicated account 
for free to provide privacy of uploaded data and results, 
as well as "analysis complete" notifications by e-mail. 

An important functionality of MALINA is that besides 
the users own data, it is possible to co-analyze it with 
cohorts from large existing human gut metagenomic 
studies: 85 Illumina samples and 37 Sanger samples 
from MetaHIT study [7], as well as 139 Illumina samples 
from HMPDACC and 96 SOLiD samples from a new, 
previously unpublished Russian metagenomic study. 
Thus clustering functionality is of particular interest to 
researchers exploring human gut microbiota in relation 
to the concept of enterotypes [16] across a large number 
of samples. A stand-alone analysis of pre-loaded datasets 
is also available. 

The user interface is implemented using Ext JS frame- 
work. Read alignment is performed using Bowtie [17]. In 
the interest of performance, MALINA does not filter 
reads using raw quality score, as the experience showed 
that filtration does not significantly increase the fraction 
of mapped reads. However, such preprocessing can be 
performed by the user manually. Coverage statistics are 
calculated using BEDtools [18]. Statistical analysis is 
implemented in R [19] using ade4 [11], cluster [20], eco- 
dist [21], fpc [22] and randomForest [23] packages. The 
pipeline steps are integrated using Oracle database, 
Microsoft .NET framework and Python. 



Conclusions 

MALINA allows an easy and intuitive way to infer meta- 
genomic composition from reads and to analyze similar- 
ity of samples and organization into clusters within the 
global context of human gut metagenomic datasets. The 
features include statistical analysis methods like cluster- 
ing and group comparison, as well as illustrative visuali- 
zations of phylogenetic and functional composition. The 
support for color-space SOLiD reads is a unique feature 
that makes MALINA a particularly valuable service to 
the growing community of researchers using SOLiD 
technology for metagenomic analysis. 

The reference gene catalogue used in MALINA will be 
updated regularly as new version of MetaHIT data 
becomes available. In the future, it is planned that add- 
itional detailed metadata will be associated with the sam- 
ples, allowing the user to check if newly sequenced 
samples are similar to certain groups distinguished by 
medical, ethno-geographic or dietary factors. The further 
development of the web service will include updates of 
the human gut samples database from Russian popula- 
tion as well as from other new studies. Support for di- 
verse types of environment profiles besides human gut 
and additional methods for statistical analysis and 
visualization will be added. 



Availability and requirements 

Project name: MALINA 

Project page: http://malina.metagenome.ru 

Operating system: platform independent web site 

Programming languages: Microsoft C# .NET, JavaScript, 

R, Python 

Other requirements: None 
License: FreeBSD 

Any restrictions to use by non-academics: None 
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BCA: Between-Class Analysis; HMPDACC: Human Microbiome Project Data 
Acquisition and Coordination Center; MDS: Multidimensional scaling; 10. 
PCA: Principal Components Analysis. 
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